INN Hotels
Objective
Aim of INN Hotels
Data Dictionary
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Booking_ID: the unique identifier of each booking
# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import scipy.stats as stats
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import chi2_contingency
from scipy.stats import shapiro
from statsmodels.stats.weightstats import ztest as ztest
%matplotlib inline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, roc_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
# mounting to drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# reading the datafile
data = pd.read_csv('/content/drive/MyDrive/INNHotelsGroup.csv')
# making a copy of the dataset
df = data.copy()
⚛ Head of the data set
# check first five rows
df.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
⚛ Tail of the dataset
# check last five rows of the dataset
df.tail()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.80 | 1 | Not_Canceled |
| 36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.95 | 2 | Canceled |
| 36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.39 | 2 | Not_Canceled |
| 36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
| 36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67 | 0 | Not_Canceled |
⚛ Shape of the dataset
# check no. of rows and columns
df.shape
(36275, 19)
⚛ Data Types present in dataset
# Analyzing the data types present in dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
⚛ Duplicates in dataset
# checking if there are duplicate entries in the dataset
df.duplicated().sum()
0
⚛ Missing values
# checking if there are missing values
df.isnull().sum()
| 0 | |
|---|---|
| Booking_ID | 0 |
| no_of_adults | 0 |
| no_of_children | 0 |
| no_of_weekend_nights | 0 |
| no_of_week_nights | 0 |
| type_of_meal_plan | 0 |
| required_car_parking_space | 0 |
| room_type_reserved | 0 |
| lead_time | 0 |
| arrival_year | 0 |
| arrival_month | 0 |
| arrival_date | 0 |
| market_segment_type | 0 |
| repeated_guest | 0 |
| no_of_previous_cancellations | 0 |
| no_of_previous_bookings_not_canceled | 0 |
| avg_price_per_room | 0 |
| no_of_special_requests | 0 |
| booking_status | 0 |
⚛ Shape of the dataset
# shape of the data set
df.shape
(36275, 19)
⚛ Descriptive statistical summary
# statistical summary
df.describe().T # Transpose for readability
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.0 | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.00 | 2.0 | 4.0 |
| no_of_children | 36275.0 | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.00 | 0.0 | 10.0 |
| no_of_weekend_nights | 36275.0 | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.00 | 2.0 | 7.0 |
| no_of_week_nights | 36275.0 | 2.204300 | 1.410905 | 0.0 | 1.0 | 2.00 | 3.0 | 17.0 |
| required_car_parking_space | 36275.0 | 0.030986 | 0.173281 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| lead_time | 36275.0 | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.00 | 126.0 | 443.0 |
| arrival_year | 36275.0 | 2017.820427 | 0.383836 | 2017.0 | 2018.0 | 2018.00 | 2018.0 | 2018.0 |
| arrival_month | 36275.0 | 7.423653 | 3.069894 | 1.0 | 5.0 | 8.00 | 10.0 | 12.0 |
| arrival_date | 36275.0 | 15.596995 | 8.740447 | 1.0 | 8.0 | 16.00 | 23.0 | 31.0 |
| repeated_guest | 36275.0 | 0.025637 | 0.158053 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| no_of_previous_cancellations | 36275.0 | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.00 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 36275.0 | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.00 | 0.0 | 58.0 |
| avg_price_per_room | 36275.0 | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
| no_of_special_requests | 36275.0 | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.00 | 1.0 | 5.0 |
⚛ Unique values
# checking unique values for each column
for column in df.columns:
print("{} : {}".format(column,df[column].unique()))
print("-"*90)
Booking_ID : ['INN00001' 'INN00002' 'INN00003' ... 'INN36273' 'INN36274' 'INN36275'] ------------------------------------------------------------------------------------------ no_of_adults : [2 1 3 0 4] ------------------------------------------------------------------------------------------ no_of_children : [ 0 2 1 3 10 9] ------------------------------------------------------------------------------------------ no_of_weekend_nights : [1 2 0 4 3 6 5 7] ------------------------------------------------------------------------------------------ no_of_week_nights : [ 2 3 1 4 5 0 10 6 11 7 15 9 13 8 14 12 17 16] ------------------------------------------------------------------------------------------ type_of_meal_plan : ['Meal Plan 1' 'Not Selected' 'Meal Plan 2' 'Meal Plan 3'] ------------------------------------------------------------------------------------------ required_car_parking_space : [0 1] ------------------------------------------------------------------------------------------ room_type_reserved : ['Room_Type 1' 'Room_Type 4' 'Room_Type 2' 'Room_Type 6' 'Room_Type 5' 'Room_Type 7' 'Room_Type 3'] ------------------------------------------------------------------------------------------ lead_time : [224 5 1 211 48 346 34 83 121 44 0 35 30 95 47 256 99 12 122 2 37 130 60 56 3 107 72 23 289 247 186 64 96 41 55 146 32 57 7 124 169 6 51 13 100 139 117 39 86 19 192 179 26 74 143 177 18 267 155 46 128 20 40 196 188 17 110 68 73 92 171 134 320 118 189 16 24 8 10 182 116 123 105 443 317 286 148 14 85 25 28 80 11 162 82 27 245 266 112 88 69 273 4 97 31 62 197 280 185 160 104 22 292 109 126 303 81 54 15 161 147 87 127 418 156 58 433 111 195 119 59 78 335 103 70 76 144 49 77 36 79 21 33 164 152 43 102 71 209 93 53 302 239 45 167 113 84 9 166 174 61 151 52 67 282 38 175 89 133 65 66 50 159 386 115 237 125 91 29 221 213 198 75 180 236 120 230 63 136 309 157 268 217 94 305 98 42 154 330 137 184 232 304 114 257 265 191 101 259 149 170 271 207 108 210 222 296 194 145 153 275 158 301 349 200 315 181 263 176 141 270 150 359 244 219 142 138 276 178 163 377 290 216 226 258 254 193 131 208 215 190 381 231 248 106 308 140 173 168 172 90 249 205 129 212 135 220 277 253 132 183 255 223 336 288 229 319 199 203 228 246 235 294 281 202 361 287 291 313 206 269 279 261 214 274 250 187 240 241 323 322 227 225 233 338 283 327 204 352 165 251 299 314 285 238 328 278 332 243 201 307 272 252 242 284 297 324 260 262 326 295 218 234 353 300 355 306 298 331 341 318 333 372 311 310 345 264 325 293 348 350 351] ------------------------------------------------------------------------------------------ arrival_year : [2017 2018] ------------------------------------------------------------------------------------------ arrival_month : [10 11 2 5 4 9 12 7 6 8 3 1] ------------------------------------------------------------------------------------------ arrival_date : [ 2 6 28 20 11 13 15 26 18 30 5 10 4 25 22 21 19 17 7 9 27 1 29 16 3 24 14 31 23 8 12] ------------------------------------------------------------------------------------------ market_segment_type : ['Offline' 'Online' 'Corporate' 'Aviation' 'Complementary'] ------------------------------------------------------------------------------------------ repeated_guest : [0 1] ------------------------------------------------------------------------------------------ no_of_previous_cancellations : [ 0 3 1 2 11 4 5 13 6] ------------------------------------------------------------------------------------------ no_of_previous_bookings_not_canceled : [ 0 5 1 3 4 12 19 2 15 17 7 20 16 50 13 6 14 34 18 8 10 23 11 49 47 53 9 33 22 24 52 21 48 28 39 25 31 38 26 51 42 37 35 56 44 27 32 55 45 30 57 46 54 43 58 41 29 40 36] ------------------------------------------------------------------------------------------ avg_price_per_room : [ 65. 106.68 60. ... 118.43 137.25 167.8 ] ------------------------------------------------------------------------------------------ no_of_special_requests : [0 1 3 2 4 5] ------------------------------------------------------------------------------------------ booking_status : ['Not_Canceled' 'Canceled'] ------------------------------------------------------------------------------------------
# visualizing the number of adults using bar plot
plt.figure(figsize=(7,4)) # size of the figure
sns.countplot(x='no_of_adults', data=df, palette="pastel")
plt.title('Number of adults') # title of the plot
plt.xlabel('Number of adults') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
# visualizing no of children using a bar plot
plt.figure(figsize=(7,5)) # size of the figure
sns.countplot(x='no_of_children', data=df, palette="pastel")
plt.title('Number of children') # title of the plot
plt.xlabel('Number of children') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
# visualizing using a histogram for no of weekend nights
plt.figure(figsize=(7,4))
sns.histplot(df['no_of_weekend_nights'],kde=True , bins = 10, color="#adf7b6")
plt.title('Distribution of weekend nights (kernal density estimate)') # title of the plot
plt.xlabel('No of weekend nights [Saturday or Sunday]') # label of x-axis
plt.ylabel('count') # label of y-axis
plt.show() # show the plot
plt.figure(figsize=(7,4))
sns.histplot(df['no_of_weekend_nights'], bins = 10, color="#a0ced9")
plt.title('Distribution of weekend nights') # title of the plot
plt.xlabel('No of weekend nights [Saturday or Sunday]') # label of x-axis
plt.ylabel('count') # label of y-axis
plt.show() # show the plot
Insight
#visualizing no of week nights stay by customer using histogram
plt.figure(figsize=(7,4)) # size of the figure
sns.histplot(df['no_of_week_nights'], bins=10,kde=True, color="#adf7b6")
plt.title('Distribution of week nights (kernal density estimate)') # title of the plot
plt.xlabel('No of week nights [Monday to Friday]') # label of x-axis
plt.ylabel('count') # label of y-axis
plt.show() # show the plot
plt.figure(figsize=(7,4)) # size of the figure
sns.histplot(df['no_of_week_nights'], bins=10, color="#a0ced9")
plt.title('Distribution of week nights') # title of the plot
plt.xlabel('No of week nights [Monday to Friday]') # label of x-axis
plt.ylabel('count') # label of y-axis
plt.show() # show the plot
Insight
# visualizing the type of meal plan selected by the customers using countplot
plt.figure(figsize=(7,4)) # size of the figure
sns.countplot(x='type_of_meal_plan', data=df, palette="pastel")
plt.title('Type of meal plan') # title of the plot
plt.xlabel('Type of meal plan') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
Type of meal plan booked by the customer:
The above plot shows most customers choose the Meal plan 1 which includes only breakfast.
# visualizing the required car parking space using a pie chart
plt.figure(figsize=(6,6))
df.required_car_parking_space.value_counts().plot(kind='pie', autopct='%1.0f%%', colors = ['#a0ced9', '#adf7b6'])
plt.title('Required car parking space') # title of the plot
plt.show()
Insight
# visualizing the room type reserved by the customers using countplot
plt.figure(figsize=(8,5)) # size of the figure
sns.countplot(x='room_type_reserved', data=df, palette="pastel")
plt.title('Room type reserved') # title of the plot
plt.xticks(rotation=90) # rotate the x-axis labels
plt.xlabel('Room type reserved') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
# visualizing the lead time using a box plot
plt.figure(figsize=(7,4))
sns.boxplot(y='lead_time', data=df, color = '#a0ced9')
plt.title('Lead time') # title of the plot
plt.xlabel('Lead time') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show()
Insight
# visualization of arrival year using bar plot
plt.figure(figsize=(7,4)) # size of the figure
sns.countplot(x='arrival_year', data=df, palette="pastel")
plt.title('Arrival year') # title of the plot
plt.xlabel('Year') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
# visualizing the arrival month using a line plot
plt.figure(figsize=(7,5))
sns.lineplot(x='arrival_month', y=df.arrival_month.value_counts(), data=df, color="#ff7477")
plt.title('Arrival month')
plt.xlabel('Month')
plt.ylabel('Count')
plt.show()
Insight
# visualizing the date in which customers arrive using line pot
plt.figure(figsize=(15,4)) # size of the figure
sns.lineplot(x='arrival_date', y=df.arrival_date.value_counts(), data=df, color="#ff7477")
plt.title('Arrival date') # title of the plot
plt.xlabel('Date') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
# visualizating market segment type using barplot
plt.figure(figsize=(7,4))
sns.countplot(x='market_segment_type', data=df, palette="pastel")
plt.title('Market segment type')
plt.xlabel('Market segment type')
plt.ylabel('Count')
plt.show()
Insight
# visualizing repeated guests using pie chrt
plt.figure(figsize=(6,6)) # size of the figure
df.repeated_guest.value_counts().plot(kind='pie', colors = ['#a0ced9', '#adf7b6'])
plt.title('Repeated guest') # title of the plot
plt.show()
Insight
Is the customer a repeated guest? (0 - No, 1- Yes)
The above chart shows there is significantly less numbers of repeated guests for INN Hotels.
# Visualizing the cancellations made previously using a histogram
plt.figure(figsize=(7,4)) # size of the figure
sns.histplot(df['no_of_previous_cancellations'], bins=10 , color="#a0ced9")
plt.title('No of previous cancellations') # title of the plot
plt.xlabel('No of previous cancellations') # label of x-axis
plt.ylabel('count') # label of y-axis
plt.show()
Insight
# visualization of previous bookings not cancelled
plt.figure(figsize=(7,4)) # size of the figure
sns.histplot(df['no_of_previous_bookings_not_canceled'], bins=10 , color="#a0ced9")
plt.title('No of previous bookings not cancellled') # title of the plot
plt.xlabel('No of previous bookings not cancelled') # label of x-axis
plt.ylabel('count') # label of y-axis
plt.show()
Insight
# visualizing the average price per room using a histogram
plt.figure(figsize=(7,4)) # size of the figure
sns.histplot(df['avg_price_per_room'], bins=10 , color="#a0ced9")
plt.title('Average price per room') # title of the plot
plt.xlabel('Average price per room') # label of x-axis
plt.ylabel('count') # label of y-axis
plt.show()
# visualizing avg price using boxplot
plt.figure(figsize=(7,4)) # size of the figure
sns.boxplot(y='avg_price_per_room', data=df, color = '#a0ced9')
plt.title('Average price per room') # title of the plot
plt.xlabel('Average price per room') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show()
Insight
# visualizing the no of special requests made by customers using bar plot
plt.figure(figsize=(7,4)) # size of the figure
sns.countplot(x='no_of_special_requests', data=df, palette="pastel")
plt.title('No of special requests') # title of the plot
plt.xlabel('No of special requests') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show()
Insight
# visualizing the booking status using a countplot
plt.figure(figsize=(7,4)) # size of the figure
sns.countplot(x='booking_status', data=df, palette="pastel")
plt.title('Booking status') # title of the plot
plt.xlabel('Booking status') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
Target :
# boxplot to visualize the distribution of average room prices for different numbers of adults.
plt.figure(figsize=(8, 6)) # size of the figure
sns.boxplot(x='no_of_adults', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Adults') # title of the plot
plt.xlabel('Number of Adults') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
#boxplot to visualize the distribution of average room prices for different numbers of children.
plt.figure(figsize=(8, 6)) # size of the figure
sns.boxplot(x='no_of_children', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Children') # title of the plot
plt.xlabel('Number of Children') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show()
Insight
Target :
# Analyzing the relation between stay on weekend night and room price with boxplot
plt.figure(figsize=(10, 6)) # size of the figure
sns.boxplot(x='no_of_weekend_nights', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Weekend Nights') # title of the plot
plt.xlabel('Number of Weekend Nights') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show()
Insight
Target :
# Analyzing the effect of lead time over the average price per room
plt.figure(figsize=(8, 6)) # size of the figure
sns.scatterplot(x='lead_time', y='avg_price_per_room', data=df, color = '#a0ced9')
plt.title('Lead Time vs Average Room Price') # title of the plot
plt.xlabel('Lead Time (Days)') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.grid(True) # show the grid
plt.show()
Insight
Target :
# Analyzing the relation between stay on week night and room price with boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='no_of_week_nights', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Week Nights')
plt.xlabel('Number of Week Nights')
plt.ylabel('Average Room Price')
plt.show()
Insight
Target:
# Analyze if bookings made in advance tend to have more special requests
plt.figure(figsize=(8, 6)) # size of the figure
sns.boxplot(x='no_of_special_requests', y='lead_time', data=df, palette='pastel')
plt.title('Lead Time vs Number of Special Requests') # title of the plot
plt.xlabel('Number of Special Requests') # label of x-axis
plt.ylabel('Lead Time (Days)') # label of y-axis
plt.show() # show the plot
Insight
Target :
# box plot analysis between the number of special requests and room prices
plt.figure(figsize=(8, 6)) # size of the figure
sns.boxplot(x='no_of_special_requests', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Special Requests') # title of the plot
plt.xlabel('Number of Special Requests') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show()
Insight
Target :
# Analyzing if there is any sesonality
plt.figure(figsize=(12, 6)) # size of the figure
sns.boxplot(x='arrival_month', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Arrival Month') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
Determine if the specific date within a month affects room price
# scattrplot Room price and arrival date
plt.figure(figsize=(15, 6)) # size of the figure
sns.scatterplot(x='arrival_date', y='avg_price_per_room', data=df, color='#a0ced9')
plt.title('Average Room Price vs Arrival Date') # title of the plot
plt.xlabel('Arrival Date') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.grid(True) # show the grid
plt.show()
Insight
Target :
# Analyzing if repeated guests tend to pay more or less per room
plt.figure(figsize=(8, 5)) # size of the figure
sns.boxplot(x='repeated_guest', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Repeated Guest') # title of the plot
plt.xlabel('Repeated Guest (0 = No, 1 = Yes)') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show()
Insight
Target :
# Analyze the relation betwen lead time and previous cancellations
plt.figure(figsize=(8, 6)) # size of the figure
sns.scatterplot(x='no_of_previous_cancellations', y='lead_time', data=df, color='#a0ced9')
plt.title('Number of Previous Cancellations vs Lead Time') # title of the plot
plt.xlabel('Number of Previous Cancellations') # label of x-axis
plt.ylabel('Lead Time (Days)') # label of y-axis
plt.grid(True) # show the grid
plt.show()
Insight
Target :
# Analyzing the relation between number of previous bookings not cancelled and room price with boxplot
plt.figure(figsize=(20,9)) # size of the figure
sns.boxplot(x='no_of_previous_bookings_not_canceled', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Previous Bookings Not Cancelled') # title of the plot
plt.xlabel('Number of Previous Bookings Not Cancelled') # label of x-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analyzing the relation between market segmentation and price of the room
plt.figure(figsize=(10, 6)) # size of the figure
sns.boxplot(x='market_segment_type', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Market Segment Type') # title of the plot
plt.xlabel('Market Segment Type') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analyzing the frequency of booking with market segment
plt.figure(figsize=(10,5 )) # size of the figure
sns.countplot(x='market_segment_type', hue='booking_status', data=df, palette='pastel')
plt.title('Booking Status by Market Segment') # title of the plot
plt.xlabel('Market Segment Type') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analyzing the booking status across no. of special requests made.
plt.figure(figsize=(8, 5)) # size of the figure
sns.boxplot(x='booking_status', y='no_of_special_requests', data=df, palette='pastel')
plt.title('Number of Special Requests vs Booking Status') # title of the plot
plt.xlabel('Booking Status') # label of x-axis
plt.ylabel('Number of Special Requests') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analyzing how special requests influence the pric charged per room
plt.figure(figsize=(8, 6)) # size of the figure
sns.boxplot(x='no_of_special_requests', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Special Requests') # title of the plot
plt.xlabel('Number of Special Requests') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analyzing the effect of booking status on price charged
plt.figure(figsize=(8, 5)) # size of the figure
sns.boxplot(x='booking_status', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Booking Status') # title of the plot
plt.xlabel('Booking Status') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analying the booking staus and lead time using boxplot
plt.figure(figsize=(8,4 )) # size of the figure
sns.boxplot(x='booking_status', y='lead_time', data=df, palette='pastel')
plt.title('Lead Time vs Booking Status') # title of the plot
plt.xlabel('Booking Status') # label of x-axis
plt.ylabel('Lead Time (Days)') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analyzing using countplot for Arrival month and Booking status
plt.figure(figsize=(12, 6)) # size of the figure
sns.countplot(x='arrival_month', hue='booking_status', data=df, palette='pastel')
plt.title('Booking Status by Arrival Month') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analyzing booking status and repeated guest
plt.figure(figsize=(8, 5)) # size of the figure
sns.countplot(x='repeated_guest', hue='booking_status', data=df, palette='pastel')
plt.title('Booking Status by Repeated Guest') # title of the plot
plt.xlabel('Repeated Guest (0 = No, 1 = Yes)') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.show() # show the plot
Insight
Target :
# Analyzing the the affect of month on the price charged
plt.figure(figsize=(12, 6)) # size of the figure
sns.boxplot(x='arrival_month', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Arrival Month') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show()
# Analyzing the the affect of month on the price charged
plt.figure(figsize=(12, 6)) # size of the figure
sns.barplot(x='arrival_month', y='avg_price_per_room', data=df, palette='pastel')
plt.title('Average Room Price vs Arrival Month') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
# clustered bar plot of Avg Price Per Room vs Market Segment Type vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.barplot(x='market_segment_type', y='avg_price_per_room', hue='booking_status', data=df, palette='pastel')
plt.title('Average Room Price vs Market Segment Type vs Booking Status') # title of the plot
plt.xlabel('Market Segment Type') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.show() # show the plot
Insights
Target :
# Analyzing using line plot for Average Price Per Room vs Arrival Month vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.lineplot(x='arrival_month', y='avg_price_per_room', hue='booking_status', data=df, marker='o')
plt.title('Average Room Price vs Arrival Month vs Booking Status') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.show() # show the plot
Insights
Some months like march show higher cancellation rates at both high and low price points, maybe caused by market conditions leading to cancellations.
Certain months like June - September show stable average prices with low cancellation rates, while other months show high variability in pricing with more cancellation
Target :
# visualizing the relation with scatterplot of Average Price Per Room vs Lead Time vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.scatterplot(x='lead_time', y='avg_price_per_room', hue='booking_status', data=df)
plt.title('Average Room Price vs Lead Time vs Booking Status') # title of the plot
plt.xlabel('Lead Time') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
# visualizing with clustered bar plot of No. of Adults vs Avg Price Per Room vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.barplot(x='no_of_adults', y='avg_price_per_room', hue='booking_status', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Adults vs Booking Status') # title of the plot
plt.xlabel('Number of Adults') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insight
Target :
# boxplot of Avg Price Per Room vs Room Type Reserved vs Booking Status
plt.figure(figsize=(12,7 )) # size of the figure
sns.boxplot(y='room_type_reserved', x='avg_price_per_room', hue='booking_status', data=df, palette='pastel')
plt.title('Average Room Price vs Room Type Reserved vs Booking Status') # title of the plot
plt.xlabel('Room Type Reserved') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.show() # show the plot
Insight
Target :
# visualizing with grouped bar plot of No. of Special Requests vs Avg Price Per Room vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.barplot(x='no_of_special_requests', y='avg_price_per_room', hue='booking_status', data=df, palette='pastel')
plt.title('Average Room Price vs Number of Special Requests vs Booking Status') # title of the plot
plt.xlabel('Number of Special Requests') # label of x-axis
plt.ylabel('Average Room Price') # label of y-axis
plt.show() # show the plot
Insights
Target :
# Analyzing relationship with heatmap of Arrival Year vs Arrival Month vs Avg Price Per Room
heatmap_data = df.pivot_table(values='avg_price_per_room', index='arrival_year', columns='arrival_month', aggfunc='mean')
plt.figure(figsize=(12, 8)) # size of the figure
sns.heatmap(heatmap_data, annot=True, fmt=".2f", cmap="viridis")
plt.title('Average Room Price by Arrival Year and Month') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('Arrival Year') # label of y-axis
plt.show() # show the plot
Insights
Target :
# prompt: visualizing with scatterplot for Lead Time vs No. of Weekend Nights vs Avg Price Per Room
plt.figure(figsize=(12,6)) # size of the figure
sns.scatterplot(x='lead_time', y='no_of_weekend_nights', hue='avg_price_per_room', data=df)
plt.title('Lead Time vs No. of Weekend Nights vs Avg Price Per Room') # title of the plot
plt.xlabel('Lead Time') # label of x-axis
plt.ylabel('No. of Weekend Nights') # label of y-axis
plt.show() # show the plot
Insights
Target :
# Visualizing with box plot of Lead Time vs Market Segment Type vs Booking Status
plt.figure(figsize=(12, 8)) # size of the figure
sns.boxplot(x='market_segment_type', y='lead_time', hue='booking_status', data=df, palette='pastel')
plt.title('Lead Time vs Market Segment Type vs Booking Status') # title of the plot
plt.xlabel('Market Segment Type') # label of x-axis
plt.ylabel('Lead Time') # label of y-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.show() # show the plot
Insights
Target :
# Visualizing with bar plot of No. of Children vs No. of Special Requests vs Booking Status
plt.figure(figsize=(10, 6)) # size of the figure
sns.countplot(x='no_of_children', hue='booking_status', data=df, palette='pastel')
plt.title('No. of Children vs No. of Special Requests vs Booking Status') # title of the plot
plt.xlabel('No. of Children') # label of x-axis
plt.ylabel('count of customers with special requests') # label of y-axis
plt.show() # show the plot
Insights
Target :
# Analyzing with scatterplot of No. of Weekend Nights vs No. of Week Nights vs Avg Price Per Room
plt.figure(figsize=(12, 6)) # size of the figure
sns.scatterplot(x='no_of_weekend_nights', y='no_of_week_nights', hue='avg_price_per_room', data=df)
plt.title('No. of Weekend Nights vs No. of Week Nights vs Avg Price Per Room') # title of the plot
plt.xlabel('No. of Weekend Nights') # label of x-axis
plt.ylabel('No. of Week Nights') # label of y-axis
plt.show() # show the plot
# Analyzing relation with heatmap of No. of Weekend Nights vs No. of Week Nights vs Avg Price Per Room
heatmap_data = df.pivot_table(values='avg_price_per_room', index='no_of_weekend_nights', columns='no_of_week_nights', aggfunc='mean')
plt.figure(figsize=(12, 8)) # size of the figure
sns.heatmap(heatmap_data, annot=True, fmt=".2f", cmap="viridis")
plt.title('Average Room Price by Weekend Nights and Week Nights') # title of the plot
plt.xlabel('Number of Week Nights') # label of x-axis
plt.ylabel('Number of Weekend Nights') # label of y-axis
plt.show() # show the plot
Insights
Target :
# Visualizing with scatterplot of No. of Previous Cancellations vs Lead Time vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.scatterplot(x='no_of_previous_cancellations', y='lead_time', hue='booking_status', data=df)
plt.title('No. of Previous Cancellations vs Lead Time vs Booking Status') # title of the plot
plt.xlabel('No. of Previous Cancellations') # label of x-axis
plt.ylabel('Lead Time') # label of y-axis
plt.show() # show the plot
# Visualizing with barplot of No. of Previous Cancellations vs Lead Time vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.barplot(x='no_of_previous_cancellations', y='lead_time', hue='booking_status', data=df, palette='pastel')
plt.title('No. of Previous Cancellations vs Lead Time vs Booking Status') # title of the plot
plt.xlabel('No. of Previous Cancellations') # label of x-axis
plt.ylabel('Lead Time') # label of y-axis
plt.show() # show the plot
Insights
Target :
# Analyzing relationship with heatmap of Market Segment Type vs Room Type Reserved vs Avg Price Per Room
heatmap_data = df.pivot_table(values='avg_price_per_room', index='market_segment_type', columns='room_type_reserved', aggfunc='mean')
plt.figure(figsize=(12, 8)) # size of the figure
sns.heatmap(heatmap_data, annot=True, fmt=".2f", cmap="viridis")
plt.title('Average Room Price by Market Segment Type and Room Type Reserved') # title of the plot
plt.xlabel('Room Type Reserved') # label of x-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.ylabel('Market Segment Type') # label of y-axis
plt.show() # show the plot
Insights
Target :
# Visualize using a barplot of Arrival Month vs No. of Special Requests vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.barplot(x='arrival_month', y='no_of_special_requests', hue='booking_status', data=df, palette='pastel')
plt.title('Arrival Month vs No. of Special Requests vs Booking Status') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('No. of Special Requests') # label of y-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.show() # show the plot
Insights
Target :
# Analyzing relationship with barplot of Arrival Month vs Lead Time vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.barplot(x='arrival_month', y='lead_time', hue='booking_status', data=df, palette='pastel')
plt.title('Arrival Month vs Lead Time vs Booking Status') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('Lead Time') # label of y-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.show() # show the plot
Insights
Target :
# analyzing with barplot of Market Segment Type vs Arrival Month vs Booking Status
plt.figure(figsize=(12, 6)) # size of the figure
sns.countplot(x='arrival_month', hue='market_segment_type', data=df, palette='pastel')
plt.title('Market Segment Type vs Arrival Month vs Booking Status') # title of the plot
plt.xlabel('Arrival Month') # label of x-axis
plt.ylabel('Count') # label of y-axis
plt.xticks(rotation=90) # rotate the x-axis labels for better visibility
plt.show() # show the plot
Insights
Target :
# analyzing with scatterplot of No. of Adults vs Lead Time vs Avg Price Per Room
plt.figure(figsize=(12, 6)) # size of the figure
sns.scatterplot(x='no_of_adults', y='lead_time', hue='avg_price_per_room', data=df)
plt.title('No. of Adults vs Lead Time vs Avg Price Per Room') # title of the plot
plt.xlabel('No. of Adults') # label of x-axis
plt.ylabel('Lead Time') # label of y-axis
plt.show() # show the plot
Insights
-- What are the busiest months in the hotel?
# To find the busiest months, let's calculate the number of bookings per month.
busiest_months = df.groupby('arrival_month')['Booking_ID'].count().reset_index()
busiest_months.columns = ['Month', 'No_of_Bookings']
# Sort by number of bookings in descending order to identify the busiest months
busiest_months = busiest_months.sort_values(by='No_of_Bookings', ascending=False)
busiest_months
| Month | No_of_Bookings | |
|---|---|---|
| 9 | 10 | 5317 |
| 8 | 9 | 4611 |
| 7 | 8 | 3813 |
| 5 | 6 | 3203 |
| 11 | 12 | 3021 |
| 10 | 11 | 2980 |
| 6 | 7 | 2920 |
| 3 | 4 | 2736 |
| 4 | 5 | 2598 |
| 2 | 3 | 2358 |
| 1 | 2 | 1704 |
| 0 | 1 | 1014 |
ANSWER
-- Which market segment do most of the guests come from?
# To determine which market segment most guests come from, let's calculate the count of bookings per market segment.
market_segment_counts = data['market_segment_type'].value_counts()
market_segment_counts
| count | |
|---|---|
| market_segment_type | |
| Online | 23214 |
| Offline | 10528 |
| Corporate | 2017 |
| Complementary | 391 |
| Aviation | 125 |
ANSWER
-- Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
# Group by market segment and calculate the mean room price
room_price_by_segment = data.groupby('market_segment_type')['avg_price_per_room'].mean().reset_index()
print(room_price_by_segment)
market_segment_type avg_price_per_room 0 Aviation 100.704000 1 Complementary 3.141765 2 Corporate 82.911740 3 Offline 91.632679 4 Online 112.256855
ANSWER
-- What percentage of bookings are canceled?
# Total number of bookings
total_bookings = len(df)
# Number of canceled bookings
canceled_bookings = len(df[df['booking_status'] == 'Canceled'])
# Calculate the percentage of canceled bookings
percentage_canceled = (canceled_bookings / total_bookings) * 100
print("Percentage of canceled bookings:", percentage_canceled)
Percentage of canceled bookings: 32.76361130254997
Answer
-- Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
# Filter for repeating guests
repeating_guests = df[df['repeated_guest'] == 1]
# Total number of repeating guests
total_repeating_guests = len(repeating_guests)
# Number of repeating guests who canceled
repeating_guests_canceled = len(repeating_guests[repeating_guests['booking_status'] == 'Canceled'])
# Calculate the percentage of repeating guests who canceled
percentage_repeating_guests_canceled = (repeating_guests_canceled / total_repeating_guests) * 100
print(f"Percentage of repeating guests who canceled: {percentage_repeating_guests_canceled:.2f}%")
Percentage of repeating guests who canceled: 1.72%
Answer
-- Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
# Guests with special requests
guests_with_requests = df[df['no_of_special_requests'] > 0]
# Guests without special requests
guests_without_requests = df[df['no_of_special_requests'] == 0]
# Calculate the cancellation rate for both groups
cancellation_rate_with_requests = (len(guests_with_requests[guests_with_requests['booking_status'] == 'Canceled'])
/ len(guests_with_requests)) * 100
cancellation_rate_without_requests = (len(guests_without_requests[guests_without_requests['booking_status'] == 'Canceled'])
/ len(guests_without_requests)) * 100
print(f"Cancellation rate for guests with special requests: {cancellation_rate_with_requests:.2f}%")
print(f"Cancellation rate for guests without special requests: {cancellation_rate_without_requests:.2f}%")
Cancellation rate for guests with special requests: 20.24% Cancellation rate for guests without special requests: 43.21%
Answer
⚛ Booking Cancellations
⚛ Special Requests
⚛ Pricing
⚛ Lead Time and Booking Status
⚛ Market Segmentation
⚛ Cancellation Policies
⚛ Improve customer Experience with Special Requests
⚛ Pricing Strategies
⚛ Targeted Marketing
⚛ Peak Seasons Management
data= df.copy()
data.loc[data['booking_status']=='Not_Canceled','booking_status'] = False
data.loc[data['booking_status']=='Canceled','booking_status'] = True
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# drop column because they were either time, or not helpful
numeric_columns.remove("arrival_year")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
data.boxplot()
plt.xticks(rotation=45)
plt.show()
#Solving the IQR fro avg price room
quartiles = np.quantile(data['avg_price_per_room'][data['avg_price_per_room'].notnull()], [.25, .75])
power_4iqr = 4 * (quartiles[1] - quartiles[0])
print(f'Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, 4*IQR = {power_4iqr}')
outlier_powers = data.loc[np.abs(data['avg_price_per_room'] - data['avg_price_per_room'].median()) > power_4iqr, 'avg_price_per_room']
outlier_powers.shape
Q1 = 80.3, Q3 = 120.0, 4*IQR = 158.8
(49,)
# creating a list of columns
dist_cols = [
item for item in data.select_dtypes(include=np.number).columns
]
plt.figure(figsize=(15, 45))
#looping the list and ploting histograms
for i in range(len(dist_cols)):
plt.subplot(12, 3, i + 1)
plt.hist(data[dist_cols[i]], bins=50)
plt.tight_layout()
plt.title(dist_cols[i], fontsize=15)
plt.show()
data2 = data.copy()
# Print dist_cols to see its contents
print("Contents of dist_cols:", dist_cols)
# List of columns to be removed
columns_to_remove = [
'no_of_week_nights', 'no_of_adults', 'length_stay', 'avg_price_per_room',
'arrival_year', 'required_car_parking_space', 'arrival_date',
'arrival_month', 'repeated_guest', 'lead_time'
]
# Safely remove columns if they exist in dist_cols
for col in columns_to_remove:
if col in dist_cols:
dist_cols.remove(col)
# Print updated dist_cols to confirm removal
print("Updated dist_cols:", dist_cols)
Contents of dist_cols: ['no_of_children', 'no_of_weekend_nights', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'no_of_special_requests'] Updated dist_cols: ['no_of_children', 'no_of_weekend_nights', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'no_of_special_requests']
# using log transforms on some columns
for col in dist_cols:
data2[col + "_log"] = np.log(data2[col] + 1)
# dropping the original columns
data2.drop(dist_cols, axis=1, inplace=True)
data2.head()
| Booking_ID | no_of_adults | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | avg_price_per_room | booking_status | no_of_children_log | no_of_weekend_nights_log | no_of_previous_cancellations_log | no_of_previous_bookings_not_canceled_log | no_of_special_requests_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 65.00 | False | 0.0 | 0.693147 | 0.0 | 0.0 | 0.000000 |
| 1 | INN00002 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 106.68 | False | 0.0 | 1.098612 | 0.0 | 0.0 | 0.693147 |
| 2 | INN00003 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 60.00 | True | 0.0 | 1.098612 | 0.0 | 0.0 | 0.000000 |
| 3 | INN00004 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 100.00 | True | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 |
| 4 | INN00005 | 2 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 94.50 | True | 0.0 | 0.693147 | 0.0 | 0.0 | 0.000000 |
# Drop booking id
data2 = data2.drop('Booking_ID', axis=1)
data2.head()
| no_of_adults | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | avg_price_per_room | booking_status | no_of_children_log | no_of_weekend_nights_log | no_of_previous_cancellations_log | no_of_previous_bookings_not_canceled_log | no_of_special_requests_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 65.00 | False | 0.0 | 0.693147 | 0.0 | 0.0 | 0.000000 |
| 1 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 106.68 | False | 0.0 | 1.098612 | 0.0 | 0.0 | 0.693147 |
| 2 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 60.00 | True | 0.0 | 1.098612 | 0.0 | 0.0 | 0.000000 |
| 3 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 100.00 | True | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 |
| 4 | 2 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 94.50 | True | 0.0 | 0.693147 | 0.0 | 0.0 | 0.000000 |
# viewing the distributions after the log transformation.
dist_cols = [
item for item in data2.select_dtypes(include=np.number).columns
]
# plot histogram of all numeric columns
plt.figure(figsize=(15, 45))
for i in range(len(dist_cols)):
plt.subplot(12, 3, i + 1)
plt.hist(data2[dist_cols[i]], bins=50) # Adjust the number of bins as needed
sns.histplot(data=data2, x=dist_cols[i], kde=True)
plt.tight_layout()
plt.title(dist_cols[i], fontsize=25)
plt.show()
# OneHotEncoding catergorical variables
dummy_data = pd.get_dummies (
data2,
columns = [
'type_of_meal_plan',
'room_type_reserved',
'market_segment_type',
],
drop_first=True,
)
dummy_data.head()
| no_of_adults | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | avg_price_per_room | booking_status | ... | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 0 | 224 | 2017 | 10 | 2 | 0 | 65.00 | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1 | 2 | 3 | 0 | 5 | 2018 | 11 | 6 | 0 | 106.68 | False | ... | False | False | False | False | False | False | False | False | False | True |
| 2 | 1 | 1 | 0 | 1 | 2018 | 2 | 28 | 0 | 60.00 | True | ... | False | False | False | False | False | False | False | False | False | True |
| 3 | 2 | 2 | 0 | 211 | 2018 | 5 | 20 | 0 | 100.00 | True | ... | False | False | False | False | False | False | False | False | False | True |
| 4 | 2 | 1 | 0 | 48 | 2018 | 4 | 11 | 0 | 94.50 | True | ... | False | False | False | False | False | False | False | False | False | True |
5 rows × 28 columns
dummy_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null int64 1 no_of_week_nights 36275 non-null int64 2 required_car_parking_space 36275 non-null int64 3 lead_time 36275 non-null int64 4 arrival_year 36275 non-null int64 5 arrival_month 36275 non-null int64 6 arrival_date 36275 non-null int64 7 repeated_guest 36275 non-null int64 8 avg_price_per_room 36275 non-null float64 9 booking_status 36275 non-null object 10 no_of_children_log 36275 non-null float64 11 no_of_weekend_nights_log 36275 non-null float64 12 no_of_previous_cancellations_log 36275 non-null float64 13 no_of_previous_bookings_not_canceled_log 36275 non-null float64 14 no_of_special_requests_log 36275 non-null float64 15 type_of_meal_plan_Meal Plan 2 36275 non-null bool 16 type_of_meal_plan_Meal Plan 3 36275 non-null bool 17 type_of_meal_plan_Not Selected 36275 non-null bool 18 room_type_reserved_Room_Type 2 36275 non-null bool 19 room_type_reserved_Room_Type 3 36275 non-null bool 20 room_type_reserved_Room_Type 4 36275 non-null bool 21 room_type_reserved_Room_Type 5 36275 non-null bool 22 room_type_reserved_Room_Type 6 36275 non-null bool 23 room_type_reserved_Room_Type 7 36275 non-null bool 24 market_segment_type_Complementary 36275 non-null bool 25 market_segment_type_Corporate 36275 non-null bool 26 market_segment_type_Offline 36275 non-null bool 27 market_segment_type_Online 36275 non-null bool dtypes: bool(13), float64(6), int64(8), object(1) memory usage: 4.6+ MB
dummied_cut = pd.cut(dummy_data['lead_time'], 5, labels=['lat_min','short','med','long','advanced'])
dummied_cut.head(10)
| lead_time | |
|---|---|
| 0 | med |
| 1 | lat_min |
| 2 | lat_min |
| 3 | med |
| 4 | lat_min |
| 5 | long |
| 6 | lat_min |
| 7 | lat_min |
| 8 | short |
| 9 | lat_min |
data3 = pd.merge(dummy_data, dummied_cut, left_index=True, right_index=True)
data3.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| no_of_adults | 2 | 2 | 1 | 2 | 2 |
| no_of_week_nights | 2 | 3 | 1 | 2 | 1 |
| required_car_parking_space | 0 | 0 | 0 | 0 | 0 |
| lead_time_x | 224 | 5 | 1 | 211 | 48 |
| arrival_year | 2017 | 2018 | 2018 | 2018 | 2018 |
| arrival_month | 10 | 11 | 2 | 5 | 4 |
| arrival_date | 2 | 6 | 28 | 20 | 11 |
| repeated_guest | 0 | 0 | 0 | 0 | 0 |
| avg_price_per_room | 65.0 | 106.68 | 60.0 | 100.0 | 94.5 |
| booking_status | False | False | True | True | True |
| no_of_children_log | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| no_of_weekend_nights_log | 0.693147 | 1.098612 | 1.098612 | 0.0 | 0.693147 |
| no_of_previous_cancellations_log | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| no_of_previous_bookings_not_canceled_log | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| no_of_special_requests_log | 0.0 | 0.693147 | 0.0 | 0.0 | 0.0 |
| type_of_meal_plan_Meal Plan 2 | False | False | False | False | False |
| type_of_meal_plan_Meal Plan 3 | False | False | False | False | False |
| type_of_meal_plan_Not Selected | False | True | False | False | True |
| room_type_reserved_Room_Type 2 | False | False | False | False | False |
| room_type_reserved_Room_Type 3 | False | False | False | False | False |
| room_type_reserved_Room_Type 4 | False | False | False | False | False |
| room_type_reserved_Room_Type 5 | False | False | False | False | False |
| room_type_reserved_Room_Type 6 | False | False | False | False | False |
| room_type_reserved_Room_Type 7 | False | False | False | False | False |
| market_segment_type_Complementary | False | False | False | False | False |
| market_segment_type_Corporate | False | False | False | False | False |
| market_segment_type_Offline | True | False | False | False | False |
| market_segment_type_Online | False | True | True | True | True |
| lead_time_y | med | lat_min | lat_min | med | lat_min |
# dropping time variables and lead_time_x since it has been binned into 5 columns.
data3_5 = data3.drop(['lead_time_x','arrival_date', 'arrival_year'], axis=1)
data4 = pd.get_dummies (
data3_5,
columns = [
'lead_time_y',
],
drop_first=True,
)
data4.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| no_of_adults | 2 | 2 | 1 | 2 | 2 |
| no_of_week_nights | 2 | 3 | 1 | 2 | 1 |
| required_car_parking_space | 0 | 0 | 0 | 0 | 0 |
| arrival_month | 10 | 11 | 2 | 5 | 4 |
| repeated_guest | 0 | 0 | 0 | 0 | 0 |
| avg_price_per_room | 65.0 | 106.68 | 60.0 | 100.0 | 94.5 |
| booking_status | False | False | True | True | True |
| no_of_children_log | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| no_of_weekend_nights_log | 0.693147 | 1.098612 | 1.098612 | 0.0 | 0.693147 |
| no_of_previous_cancellations_log | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| no_of_previous_bookings_not_canceled_log | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| no_of_special_requests_log | 0.0 | 0.693147 | 0.0 | 0.0 | 0.0 |
| type_of_meal_plan_Meal Plan 2 | False | False | False | False | False |
| type_of_meal_plan_Meal Plan 3 | False | False | False | False | False |
| type_of_meal_plan_Not Selected | False | True | False | False | True |
| room_type_reserved_Room_Type 2 | False | False | False | False | False |
| room_type_reserved_Room_Type 3 | False | False | False | False | False |
| room_type_reserved_Room_Type 4 | False | False | False | False | False |
| room_type_reserved_Room_Type 5 | False | False | False | False | False |
| room_type_reserved_Room_Type 6 | False | False | False | False | False |
| room_type_reserved_Room_Type 7 | False | False | False | False | False |
| market_segment_type_Complementary | False | False | False | False | False |
| market_segment_type_Corporate | False | False | False | False | False |
| market_segment_type_Offline | True | False | False | False | False |
| market_segment_type_Online | False | True | True | True | True |
| lead_time_y_short | False | False | False | False | False |
| lead_time_y_med | True | False | False | True | False |
| lead_time_y_long | False | False | False | False | False |
| lead_time_y_advanced | False | False | False | False | False |
data4 = data4.astype(float)
data4.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null float64 1 no_of_week_nights 36275 non-null float64 2 required_car_parking_space 36275 non-null float64 3 arrival_month 36275 non-null float64 4 repeated_guest 36275 non-null float64 5 avg_price_per_room 36275 non-null float64 6 booking_status 36275 non-null float64 7 no_of_children_log 36275 non-null float64 8 no_of_weekend_nights_log 36275 non-null float64 9 no_of_previous_cancellations_log 36275 non-null float64 10 no_of_previous_bookings_not_canceled_log 36275 non-null float64 11 no_of_special_requests_log 36275 non-null float64 12 type_of_meal_plan_Meal Plan 2 36275 non-null float64 13 type_of_meal_plan_Meal Plan 3 36275 non-null float64 14 type_of_meal_plan_Not Selected 36275 non-null float64 15 room_type_reserved_Room_Type 2 36275 non-null float64 16 room_type_reserved_Room_Type 3 36275 non-null float64 17 room_type_reserved_Room_Type 4 36275 non-null float64 18 room_type_reserved_Room_Type 5 36275 non-null float64 19 room_type_reserved_Room_Type 6 36275 non-null float64 20 room_type_reserved_Room_Type 7 36275 non-null float64 21 market_segment_type_Complementary 36275 non-null float64 22 market_segment_type_Corporate 36275 non-null float64 23 market_segment_type_Offline 36275 non-null float64 24 market_segment_type_Online 36275 non-null float64 25 lead_time_y_short 36275 non-null float64 26 lead_time_y_med 36275 non-null float64 27 lead_time_y_long 36275 non-null float64 28 lead_time_y_advanced 36275 non-null float64 dtypes: float64(29) memory usage: 8.0 MB
# Assuming data has the original 'booking_status' column
if 'booking_status' in data.columns:
data4['booking_status'] = data['booking_status']
else:
print("'booking_status' not found in the original data")
print(data4.columns)
Index(['no_of_adults', 'no_of_week_nights', 'required_car_parking_space',
'arrival_month', 'repeated_guest', 'avg_price_per_room',
'no_of_children_log', 'no_of_weekend_nights_log',
'no_of_previous_cancellations_log',
'no_of_previous_bookings_not_canceled_log',
'no_of_special_requests_log', 'type_of_meal_plan_Meal Plan 2',
'type_of_meal_plan_Meal Plan 3', 'type_of_meal_plan_Not Selected',
'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 3',
'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5',
'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7',
'market_segment_type_Complementary', 'market_segment_type_Corporate',
'market_segment_type_Offline', 'market_segment_type_Online',
'lead_time_y_short', 'lead_time_y_med', 'lead_time_y_long',
'lead_time_y_advanced', 'booking_status'],
dtype='object')
# Using the SCIEM method I will split the train test data first.
X = data4.drop("booking_status" , axis=1)
y = data4.pop("booking_status")
# Import add_constant from statsmodels
from statsmodels.api import add_constant
# Adding a constant (intercept) column to X
X = add_constant(X)
# Now, X has a constant column added
print(X.head())
const no_of_adults no_of_week_nights required_car_parking_space \ 0 1.0 2.0 2.0 0.0 1 1.0 2.0 3.0 0.0 2 1.0 1.0 1.0 0.0 3 1.0 2.0 2.0 0.0 4 1.0 2.0 1.0 0.0 arrival_month repeated_guest avg_price_per_room no_of_children_log \ 0 10.0 0.0 65.00 0.0 1 11.0 0.0 106.68 0.0 2 2.0 0.0 60.00 0.0 3 5.0 0.0 100.00 0.0 4 4.0 0.0 94.50 0.0 no_of_weekend_nights_log no_of_previous_cancellations_log ... \ 0 0.693147 0.0 ... 1 1.098612 0.0 ... 2 1.098612 0.0 ... 3 0.000000 0.0 ... 4 0.693147 0.0 ... room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 \ 0 0.0 0.0 1 0.0 0.0 2 0.0 0.0 3 0.0 0.0 4 0.0 0.0 market_segment_type_Complementary market_segment_type_Corporate \ 0 0.0 0.0 1 0.0 0.0 2 0.0 0.0 3 0.0 0.0 4 0.0 0.0 market_segment_type_Offline market_segment_type_Online lead_time_y_short \ 0 1.0 0.0 0.0 1 0.0 1.0 0.0 2 0.0 1.0 0.0 3 0.0 1.0 0.0 4 0.0 1.0 0.0 lead_time_y_med lead_time_y_long lead_time_y_advanced 0 1.0 0.0 0.0 1 0.0 0.0 0.0 2 0.0 0.0 0.0 3 1.0 0.0 0.0 4 0.0 0.0 0.0 [5 rows x 29 columns]
# Train/Test Split 70/30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 25392 Number of rows in test data = 10883
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Percentage of classes in training set: booking_status 0.0 0.670644 1.0 0.329356 Name: proportion, dtype: float64 Percentage of classes in test set: booking_status 0.0 0.676376 1.0 0.323624 Name: proportion, dtype: float64
X_train.info()
<class 'pandas.core.frame.DataFrame'> Index: 25392 entries, 13662 to 33003 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25392 non-null float64 1 no_of_week_nights 25392 non-null float64 2 required_car_parking_space 25392 non-null float64 3 arrival_month 25392 non-null float64 4 repeated_guest 25392 non-null float64 5 avg_price_per_room 25392 non-null float64 6 no_of_children_log 25392 non-null float64 7 no_of_weekend_nights_log 25392 non-null float64 8 no_of_previous_cancellations_log 25392 non-null float64 9 no_of_previous_bookings_not_canceled_log 25392 non-null float64 10 no_of_special_requests_log 25392 non-null float64 11 type_of_meal_plan_Meal Plan 2 25392 non-null float64 12 type_of_meal_plan_Meal Plan 3 25392 non-null float64 13 type_of_meal_plan_Not Selected 25392 non-null float64 14 room_type_reserved_Room_Type 2 25392 non-null float64 15 room_type_reserved_Room_Type 3 25392 non-null float64 16 room_type_reserved_Room_Type 4 25392 non-null float64 17 room_type_reserved_Room_Type 5 25392 non-null float64 18 room_type_reserved_Room_Type 6 25392 non-null float64 19 room_type_reserved_Room_Type 7 25392 non-null float64 20 market_segment_type_Complementary 25392 non-null float64 21 market_segment_type_Corporate 25392 non-null float64 22 market_segment_type_Offline 25392 non-null float64 23 market_segment_type_Online 25392 non-null float64 24 lead_time_y_short 25392 non-null float64 25 lead_time_y_med 25392 non-null float64 26 lead_time_y_long 25392 non-null float64 27 lead_time_y_advanced 25392 non-null float64 dtypes: float64(28) memory usage: 5.6 MB
# It is a good idea to explore the data once again after manipulating it.
plt.figure(figsize=(20,10))
sns.heatmap(
data4.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
<Axes: >
In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.
import statsmodels.api as sm
from statsmodels.api import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: no_of_adults 18.197098 no_of_week_nights 3.680150 required_car_parking_space 1.075494 arrival_month 7.097404 repeated_guest 3.417968 avg_price_per_room 17.989022 no_of_children_log 2.006153 no_of_weekend_nights_log 2.133807 no_of_previous_cancellations_log 1.608889 no_of_previous_bookings_not_canceled_log 3.570483 no_of_special_requests_log 2.202569 type_of_meal_plan_Meal Plan 2 1.334174 type_of_meal_plan_Meal Plan 3 1.025507 type_of_meal_plan_Not Selected 1.432984 room_type_reserved_Room_Type 2 1.111269 room_type_reserved_Room_Type 3 1.003573 room_type_reserved_Room_Type 4 1.638586 room_type_reserved_Room_Type 5 1.034873 room_type_reserved_Room_Type 6 1.903476 room_type_reserved_Room_Type 7 1.114383 market_segment_type_Complementary 1.296787 market_segment_type_Corporate 2.418865 market_segment_type_Offline 9.385691 market_segment_type_Online 23.630149 lead_time_y_short 1.424498 lead_time_y_med 1.218744 lead_time_y_long 1.207261 lead_time_y_advanced 1.054674 dtype: float64
#dropping the number of weekend & week nights because I have combined them into one & market segements because they all have large multi values
X_train1 = X_train.drop(['no_of_weekend_nights_log',
'no_of_week_nights',
'market_segment_type_Online',
'market_segment_type_Offline',
'market_segment_type_Corporate',
'market_segment_type_Complementary'],
axis=1)
Building a Logistic Regression model
# Import necessary libraries
import statsmodels.api as sm
from statsmodels.api import add_constant
logit = sm.Logit(y_train, X_train1.astype(float))
lg = logit.fit()
Optimization terminated successfully.
Current function value: 0.495874
Iterations 10
# print the logistic regression summary
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25370
Method: MLE Df Model: 21
Date: Sun, 08 Sep 2024 Pseudo R-squ.: 0.2175
Time: 16:03:56 Log-Likelihood: -12591.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
============================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------------------
no_of_adults -0.4712 0.028 -17.051 0.000 -0.525 -0.417
required_car_parking_space -1.3018 0.130 -9.990 0.000 -1.557 -1.046
arrival_month -0.1376 0.005 -27.464 0.000 -0.147 -0.128
repeated_guest -3.2515 0.644 -5.050 0.000 -4.513 -1.990
avg_price_per_room 0.0086 0.001 16.833 0.000 0.008 0.010
no_of_children_log 0.5743 0.091 6.341 0.000 0.397 0.752
no_of_previous_cancellations_log 1.2858 0.459 2.799 0.005 0.385 2.186
no_of_previous_bookings_not_canceled_log -1.0502 0.527 -1.993 0.046 -2.083 -0.017
no_of_special_requests_log -1.6744 0.041 -40.640 0.000 -1.755 -1.594
type_of_meal_plan_Meal Plan 2 -0.2106 0.055 -3.841 0.000 -0.318 -0.103
type_of_meal_plan_Meal Plan 3 0.5421 1.286 0.422 0.673 -1.978 3.062
type_of_meal_plan_Not Selected 0.5950 0.046 12.887 0.000 0.504 0.685
room_type_reserved_Room_Type 2 -0.3656 0.119 -3.074 0.002 -0.599 -0.132
room_type_reserved_Room_Type 3 -0.3770 0.987 -0.382 0.703 -2.312 1.558
room_type_reserved_Room_Type 4 0.5761 0.046 12.576 0.000 0.486 0.666
room_type_reserved_Room_Type 5 -0.6209 0.192 -3.228 0.001 -0.998 -0.244
room_type_reserved_Room_Type 6 0.0997 0.127 0.783 0.434 -0.150 0.349
room_type_reserved_Room_Type 7 -0.0272 0.264 -0.103 0.918 -0.545 0.490
lead_time_y_short 1.1559 0.037 31.123 0.000 1.083 1.229
lead_time_y_med 2.7193 0.057 47.914 0.000 2.608 2.831
lead_time_y_long 2.8778 0.076 37.914 0.000 2.729 3.027
lead_time_y_advanced 4.3622 0.248 17.618 0.000 3.877 4.847
============================================================================================================
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: no_of_adults 11.519816 required_car_parking_space 1.069077 arrival_month 5.913945 repeated_guest 3.260798 avg_price_per_room 11.718683 no_of_children_log 1.991388 no_of_previous_cancellations_log 1.587147 no_of_previous_bookings_not_canceled_log 3.488538 no_of_special_requests_log 1.961361 type_of_meal_plan_Meal Plan 2 1.243078 type_of_meal_plan_Meal Plan 3 1.017756 type_of_meal_plan_Not Selected 1.276209 room_type_reserved_Room_Type 2 1.089233 room_type_reserved_Room_Type 3 1.000940 room_type_reserved_Room_Type 4 1.494463 room_type_reserved_Room_Type 5 1.020678 room_type_reserved_Room_Type 6 1.823130 room_type_reserved_Room_Type 7 1.065028 lead_time_y_short 1.372970 lead_time_y_med 1.192447 lead_time_y_long 1.179291 lead_time_y_advanced 1.051144 dtype: float64
# test performance
pred_train = lg.predict(X_train1) > 0.5
pred_train = np.round(pred_train)
X_train2 = X_train1.drop(['room_type_reserved_Room_Type 3'], axis=1)
X_train2.info()
<class 'pandas.core.frame.DataFrame'> Index: 25392 entries, 13662 to 33003 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25392 non-null float64 1 required_car_parking_space 25392 non-null float64 2 arrival_month 25392 non-null float64 3 repeated_guest 25392 non-null float64 4 avg_price_per_room 25392 non-null float64 5 no_of_children_log 25392 non-null float64 6 no_of_previous_cancellations_log 25392 non-null float64 7 no_of_previous_bookings_not_canceled_log 25392 non-null float64 8 no_of_special_requests_log 25392 non-null float64 9 type_of_meal_plan_Meal Plan 2 25392 non-null float64 10 type_of_meal_plan_Meal Plan 3 25392 non-null float64 11 type_of_meal_plan_Not Selected 25392 non-null float64 12 room_type_reserved_Room_Type 2 25392 non-null float64 13 room_type_reserved_Room_Type 4 25392 non-null float64 14 room_type_reserved_Room_Type 5 25392 non-null float64 15 room_type_reserved_Room_Type 6 25392 non-null float64 16 room_type_reserved_Room_Type 7 25392 non-null float64 17 lead_time_y_short 25392 non-null float64 18 lead_time_y_med 25392 non-null float64 19 lead_time_y_long 25392 non-null float64 20 lead_time_y_advanced 25392 non-null float64 dtypes: float64(21) memory usage: 4.3 MB
logit = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit.fit()
Optimization terminated successfully.
Current function value: 0.495877
Iterations 10
print(lg2.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25371
Method: MLE Df Model: 20
Date: Sun, 08 Sep 2024 Pseudo R-squ.: 0.2175
Time: 16:07:42 Log-Likelihood: -12591.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
============================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------------------
no_of_adults -0.4712 0.028 -17.051 0.000 -0.525 -0.417
required_car_parking_space -1.3018 0.130 -9.990 0.000 -1.557 -1.046
arrival_month -0.1376 0.005 -27.477 0.000 -0.147 -0.128
repeated_guest -3.2514 0.644 -5.050 0.000 -4.513 -1.989
avg_price_per_room 0.0086 0.001 16.836 0.000 0.008 0.010
no_of_children_log 0.5743 0.091 6.341 0.000 0.397 0.752
no_of_previous_cancellations_log 1.2857 0.459 2.799 0.005 0.385 2.186
no_of_previous_bookings_not_canceled_log -1.0503 0.527 -1.993 0.046 -2.083 -0.017
no_of_special_requests_log -1.6742 0.041 -40.638 0.000 -1.755 -1.593
type_of_meal_plan_Meal Plan 2 -0.2105 0.055 -3.839 0.000 -0.318 -0.103
type_of_meal_plan_Meal Plan 3 0.5422 1.286 0.422 0.673 -1.978 3.062
type_of_meal_plan_Not Selected 0.5949 0.046 12.887 0.000 0.504 0.685
room_type_reserved_Room_Type 2 -0.3655 0.119 -3.073 0.002 -0.599 -0.132
room_type_reserved_Room_Type 4 0.5762 0.046 12.577 0.000 0.486 0.666
room_type_reserved_Room_Type 5 -0.6208 0.192 -3.228 0.001 -0.998 -0.244
room_type_reserved_Room_Type 6 0.0996 0.127 0.782 0.434 -0.150 0.349
room_type_reserved_Room_Type 7 -0.0273 0.264 -0.103 0.918 -0.545 0.490
lead_time_y_short 1.1558 0.037 31.121 0.000 1.083 1.229
lead_time_y_med 2.7193 0.057 47.913 0.000 2.608 2.831
lead_time_y_long 2.8780 0.076 37.917 0.000 2.729 3.027
lead_time_y_advanced 4.3622 0.248 17.618 0.000 3.877 4.848
============================================================================================================
X_train3 = X_train2.drop(['no_of_previous_bookings_not_canceled_log'], axis=1)
logit = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit.fit()
Optimization terminated successfully.
Current function value: 0.495990
Iterations 9
print(lg3.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25372
Method: MLE Df Model: 19
Date: Sun, 08 Sep 2024 Pseudo R-squ.: 0.2173
Time: 16:08:29 Log-Likelihood: -12594.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
====================================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------------
no_of_adults -0.4715 0.028 -17.060 0.000 -0.526 -0.417
required_car_parking_space -1.3007 0.130 -9.983 0.000 -1.556 -1.045
arrival_month -0.1376 0.005 -27.482 0.000 -0.147 -0.128
repeated_guest -3.9896 0.551 -7.239 0.000 -5.070 -2.909
avg_price_per_room 0.0086 0.001 16.848 0.000 0.008 0.010
no_of_children_log 0.5749 0.091 6.348 0.000 0.397 0.752
no_of_previous_cancellations_log 1.0029 0.376 2.666 0.008 0.265 1.740
no_of_special_requests_log -1.6758 0.041 -40.677 0.000 -1.757 -1.595
type_of_meal_plan_Meal Plan 2 -0.2122 0.055 -3.869 0.000 -0.320 -0.105
type_of_meal_plan_Meal Plan 3 0.5422 1.285 0.422 0.673 -1.977 3.062
type_of_meal_plan_Not Selected 0.5952 0.046 12.892 0.000 0.505 0.686
room_type_reserved_Room_Type 2 -0.3661 0.119 -3.077 0.002 -0.599 -0.133
room_type_reserved_Room_Type 4 0.5764 0.046 12.581 0.000 0.487 0.666
room_type_reserved_Room_Type 5 -0.6225 0.192 -3.238 0.001 -0.999 -0.246
room_type_reserved_Room_Type 6 0.0989 0.127 0.777 0.437 -0.151 0.348
room_type_reserved_Room_Type 7 -0.0296 0.264 -0.112 0.911 -0.547 0.488
lead_time_y_short 1.1561 0.037 31.129 0.000 1.083 1.229
lead_time_y_med 2.7213 0.057 47.943 0.000 2.610 2.833
lead_time_y_long 2.8798 0.076 37.924 0.000 2.731 3.029
lead_time_y_advanced 4.4037 0.250 17.620 0.000 3.914 4.894
====================================================================================================
# let's check the VIF of the predictors again to see if any Multicollinearity persist
vif_series = pd.Series(
[variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
index=X_train3.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: no_of_adults 11.519270 required_car_parking_space 1.068324 arrival_month 5.911260 repeated_guest 1.508209 avg_price_per_room 11.712940 no_of_children_log 1.991288 no_of_previous_cancellations_log 1.436736 no_of_special_requests_log 1.952941 type_of_meal_plan_Meal Plan 2 1.242879 type_of_meal_plan_Meal Plan 3 1.017726 type_of_meal_plan_Not Selected 1.276122 room_type_reserved_Room_Type 2 1.089226 room_type_reserved_Room_Type 4 1.494420 room_type_reserved_Room_Type 5 1.019830 room_type_reserved_Room_Type 6 1.822871 room_type_reserved_Room_Type 7 1.064638 lead_time_y_short 1.372770 lead_time_y_med 1.192427 lead_time_y_long 1.179198 lead_time_y_advanced 1.050944 dtype: float64
X_train4 = X_train3.drop(['room_type_reserved_Room_Type 2'], axis=1)
logit = sm.Logit(y_train, X_train4.astype(float))
lg4 = logit.fit()
Optimization terminated successfully.
Current function value: 0.496180
Iterations 9
print(lg4.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25373
Method: MLE Df Model: 18
Date: Sun, 08 Sep 2024 Pseudo R-squ.: 0.2170
Time: 16:09:27 Log-Likelihood: -12599.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
====================================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------------
no_of_adults -0.4727 0.028 -17.118 0.000 -0.527 -0.419
required_car_parking_space -1.3070 0.130 -10.021 0.000 -1.563 -1.051
arrival_month -0.1376 0.005 -27.504 0.000 -0.147 -0.128
repeated_guest -3.9887 0.551 -7.239 0.000 -5.069 -2.909
avg_price_per_room 0.0086 0.001 16.834 0.000 0.008 0.010
no_of_children_log 0.5073 0.088 5.771 0.000 0.335 0.680
no_of_previous_cancellations_log 1.0050 0.376 2.674 0.008 0.268 1.742
no_of_special_requests_log -1.6788 0.041 -40.761 0.000 -1.760 -1.598
type_of_meal_plan_Meal Plan 2 -0.2055 0.055 -3.750 0.000 -0.313 -0.098
type_of_meal_plan_Meal Plan 3 0.5230 1.282 0.408 0.683 -1.991 3.037
type_of_meal_plan_Not Selected 0.6001 0.046 13.005 0.000 0.510 0.691
room_type_reserved_Room_Type 4 0.5847 0.046 12.782 0.000 0.495 0.674
room_type_reserved_Room_Type 5 -0.6109 0.192 -3.180 0.001 -0.987 -0.234
room_type_reserved_Room_Type 6 0.1712 0.125 1.369 0.171 -0.074 0.416
room_type_reserved_Room_Type 7 0.0189 0.263 0.072 0.943 -0.496 0.534
lead_time_y_short 1.1533 0.037 31.065 0.000 1.080 1.226
lead_time_y_med 2.7090 0.057 47.907 0.000 2.598 2.820
lead_time_y_long 2.8751 0.076 37.901 0.000 2.726 3.024
lead_time_y_advanced 4.3985 0.250 17.604 0.000 3.909 4.888
====================================================================================================
X_train5 = X_train4.drop(['room_type_reserved_Room_Type 4'], axis=1)
logit = sm.Logit(y_train, X_train5.astype(float))
lg5 = logit.fit()
Optimization terminated successfully.
Current function value: 0.499373
Iterations 9
print(lg5.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25374
Method: MLE Df Model: 17
Date: Sun, 08 Sep 2024 Pseudo R-squ.: 0.2120
Time: 16:10:07 Log-Likelihood: -12680.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
====================================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------------
no_of_adults -0.4324 0.027 -15.812 0.000 -0.486 -0.379
required_car_parking_space -1.2989 0.130 -10.009 0.000 -1.553 -1.045
arrival_month -0.1445 0.005 -29.008 0.000 -0.154 -0.135
repeated_guest -4.0168 0.550 -7.300 0.000 -5.095 -2.938
avg_price_per_room 0.0099 0.001 19.711 0.000 0.009 0.011
no_of_children_log 0.4024 0.087 4.632 0.000 0.232 0.573
no_of_previous_cancellations_log 0.9906 0.377 2.629 0.009 0.252 1.729
no_of_special_requests_log -1.6350 0.041 -40.101 0.000 -1.715 -1.555
type_of_meal_plan_Meal Plan 2 -0.3010 0.054 -5.548 0.000 -0.407 -0.195
type_of_meal_plan_Meal Plan 3 0.5454 1.349 0.404 0.686 -2.098 3.189
type_of_meal_plan_Not Selected 0.4520 0.044 10.200 0.000 0.365 0.539
room_type_reserved_Room_Type 5 -0.7607 0.191 -3.988 0.000 -1.135 -0.387
room_type_reserved_Room_Type 6 -0.0001 0.123 -0.001 0.999 -0.242 0.242
room_type_reserved_Room_Type 7 -0.2404 0.261 -0.922 0.356 -0.751 0.270
lead_time_y_short 1.1233 0.037 30.394 0.000 1.051 1.196
lead_time_y_med 2.6371 0.056 47.049 0.000 2.527 2.747
lead_time_y_long 2.7832 0.075 36.876 0.000 2.635 2.931
lead_time_y_advanced 4.3071 0.250 17.242 0.000 3.817 4.797
====================================================================================================
# converting coefficients to odds
odds = np.exp(lg5.params)
# adding the odds to a dataframe
pd.DataFrame(odds, X_train5.columns, columns=["odds"]).T
| no_of_adults | required_car_parking_space | arrival_month | repeated_guest | avg_price_per_room | no_of_children_log | no_of_previous_cancellations_log | no_of_special_requests_log | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | lead_time_y_short | lead_time_y_med | lead_time_y_long | lead_time_y_advanced | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| odds | 0.648972 | 0.272826 | 0.865486 | 0.018011 | 1.009954 | 1.495483 | 2.692794 | 0.194945 | 0.740111 | 1.725311 | 1.57153 | 0.46736 | 0.999857 | 0.78635 | 3.074889 | 13.973046 | 16.170171 | 74.22528 |
# finding the percentage change
perc_change_odds = (np.exp(lg5.params) - 1) * 100
# adding the change_odds% to a dataframe
pd.DataFrame(perc_change_odds, X_train3.columns, columns=["change_odds%"]).T
| no_of_adults | required_car_parking_space | arrival_month | repeated_guest | avg_price_per_room | no_of_children_log | no_of_previous_cancellations_log | no_of_special_requests_log | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | lead_time_y_short | lead_time_y_med | lead_time_y_long | lead_time_y_advanced | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| change_odds% | -35.102845 | -72.717418 | -13.451435 | -98.198868 | 0.995447 | 49.548294 | 169.279353 | -80.505527 | -25.988902 | 72.531135 | 57.152989 | NaN | NaN | -53.263997 | -0.014306 | -21.364981 | 207.488857 | 1297.304607 | 1517.017069 | 7322.527969 |
# fitting the model on training set
logit = sm.Logit(y_train, X_train5.astype(float))
lg3 = logit.fit()
pred_train4 = lg5.predict(X_train5)
pred_train4 = np.round(pred_train4)
Optimization terminated successfully.
Current function value: 0.499373
Iterations 9
# another confusion matrix
cm = confusion_matrix(y_train, pred_train4)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
print("Accuracy on training set : ", accuracy_score(y_train, pred_train4))
Accuracy on training set : 0.7559861373660995
logit_roc_auc_train = roc_auc_score(y_train, lg5.predict(X_train5))
fpr, tpr, thresholds = roc_curve(y_train, lg5.predict(X_train5))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# dropping variables from test set as well which were dropped from training set
X_test1 = X_test.drop([ 'no_of_weekend_nights_log',
'no_of_week_nights',
'market_segment_type_Online',
'market_segment_type_Offline',
'market_segment_type_Corporate',
'market_segment_type_Complementary',
'room_type_reserved_Room_Type 3',
'room_type_reserved_Room_Type 4',
'no_of_previous_bookings_not_canceled_log',
'room_type_reserved_Room_Type 2'
], axis=1)
pred_test = lg5.predict(X_test1) > 0.5
pred_test = np.round(pred_test)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train4))
print("Accuracy on test set : ", accuracy_score(y_test, pred_test))
Accuracy on training set : 0.7559861373660995 Accuracy on test set : 0.7647707433612055
Building a Decision Tree model
tree_data = dummy_data.astype(float)
tree_data = tree_data.drop(['arrival_date','arrival_year','no_of_week_nights',
'no_of_weekend_nights_log' ], axis=1)
X = tree_data.drop("booking_status" , axis=1)
y = tree_data.pop("booking_status")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
# building a decision tree using the dtclassifier function
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
#scoring the accuracy on train & test data
print("Accuracy on training set : ",dTree.score(X_train, y_train))
print("Accuracy on test set : ",dTree.score(X_test, y_test))
Accuracy on training set : 0.9884215500945179 Accuracy on test set : 0.8645594045759442
# checking the positive outcomes
y.sum(axis = 0)
11885.0
Insights
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
# Import necessary libraries
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Define a function to make the confusion matrix
def make_confusion_matrix(model, y_test, X_test):
# Generate predictions using the model
y_pred = model.predict(X_test)
# Create the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['False', 'True'], yticklabels=['False', 'True'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
# Assuming you have a trained decision tree model `dTree`
# Now call the function to display the confusion matrix
make_confusion_matrix(dTree, y_test, X_test)
# Import necessary library
from sklearn.metrics import recall_score
# Define a function to calculate recall score on train and test data
def get_recall_score(model, X_train, y_train, X_test, y_test):
# Get predictions on both training and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculate recall score on training set
recall_train = recall_score(y_train, y_train_pred)
# Calculate recall score on test set
recall_test = recall_score(y_test, y_test_pred)
# Print the recall scores
print(f"Recall Score on Training Set: {recall_train:.4f}")
print(f"Recall Score on Test Set: {recall_test:.4f}")
# Assuming you have a trained decision tree model `dTree` and train/test data
get_recall_score(dTree, X_train, y_train, X_test, y_test)
Recall Score on Training Set: 0.9711 Recall Score on Test Set: 0.7978
the_features = list(X.columns)
print(the_features)
['no_of_adults', 'required_car_parking_space', 'lead_time', 'arrival_month', 'repeated_guest', 'avg_price_per_room', 'no_of_children_log', 'no_of_previous_cancellations_log', 'no_of_previous_bookings_not_canceled_log', 'no_of_special_requests_log', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Meal Plan 3', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 3', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Complementary', 'market_segment_type_Corporate', 'market_segment_type_Offline', 'market_segment_type_Online']
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
# checking out what variables are being prioritized by the model.
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp lead_time 0.428752 avg_price_per_room 0.229776 market_segment_type_Online 0.093703 arrival_month 0.089176 no_of_special_requests_log 0.069752 no_of_adults 0.033601 type_of_meal_plan_Not Selected 0.011043 room_type_reserved_Room_Type 4 0.010264 required_car_parking_space 0.007843 no_of_children_log 0.007265 type_of_meal_plan_Meal Plan 2 0.005708 market_segment_type_Offline 0.004476 room_type_reserved_Room_Type 2 0.002571 room_type_reserved_Room_Type 5 0.001512 room_type_reserved_Room_Type 6 0.001388 market_segment_type_Corporate 0.001179 no_of_previous_bookings_not_canceled_log 0.000707 room_type_reserved_Room_Type 7 0.000500 repeated_guest 0.000433 no_of_previous_cancellations_log 0.000339 room_type_reserved_Room_Type 3 0.000013 market_segment_type_Complementary 0.000000 type_of_meal_plan_Meal Plan 3 0.000000
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [the_features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Prune the Model
# Pre prune the model with max depth hyperparameter
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=3, random_state=1)
# Define the confusion matrix function
def make_confusion_matrix(model, X_test, y_test):
# Generate predictions using the model
y_pred = model.predict(X_test)
# Create the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix using seaborn
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['False', 'True'], yticklabels=['False', 'True'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
# Call the function with the correct arguments
make_confusion_matrix(dTree1, X_test, y_test) # Ensure X_test is passed along with y_test
# The accuracy on the pre pruned tree.
print("Accuracy on training set : ",dTree1.score(X_train, y_train))
print("Accuracy on test set : ",dTree1.score(X_test, y_test))
Accuracy on training set : 0.7844202898550725 Accuracy on test set : 0.7913259211614444
Insights
# Looking at the feature importances of this model
importances = dTree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [the_features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
GridSearch to hyperparameter tune the model
# Import necessary libraries
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, recall_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Choose the type of classifier
estimator = DecisionTreeClassifier(random_state=1)
# Reduce the parameter grid to fewer combinations
parameters = {
'max_depth': np.arange(3, 8), # Narrowed the range to reduce combinations
'min_samples_leaf': [1, 5, 10], # Fewer values
'max_leaf_nodes': [5, 10], # Reduced options
'min_impurity_decrease': [0.001, 0.01] # Simplified grid
}
# Scoring function used to compare parameter combinations (recall score)
acc_scorer = make_scorer(recall_score)
# Run the grid search with reduced cv and parallel processing
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=3, n_jobs=-1) # Use 3-fold CV and parallelization
grid_obj = grid_obj.fit(X_train, y_train)
# Set the estimator to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data
estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, max_leaf_nodes=5,
min_impurity_decrease=0.001, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=3, max_leaf_nodes=5,
min_impurity_decrease=0.001, random_state=1)-max_depth=3 -max_leaf_nodes_nodes=5 -min_impurity_decrease=.001 -random_state=1
# Define the confusion matrix function (if not already defined)
def make_confusion_matrix(model, X_test, y_test):
# Generate predictions using the model
y_pred = model.predict(X_test)
# Create the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix using seaborn
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['False', 'True'], yticklabels=['False', 'True'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
# Assuming `estimator` is your trained model and `X_test`, `y_test` are defined
# Call the function with both X_test and y_test as arguments
make_confusion_matrix(estimator, X_test, y_test)
# The accuracy on the estimator tree.
print("Accuracy on training set : ",estimator.score(X_train, y_train))
print("Accuracy on test set : ",estimator.score(X_test, y_test))
Accuracy on training set : 0.7694943289224953 Accuracy on test set : 0.7719378847744188
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [the_features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Cost Complexity Pruning
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.013864 |
| 1 | 0.000000 | 0.013864 |
| 2 | 0.000000 | 0.013864 |
| 3 | 0.000000 | 0.013864 |
| 4 | 0.000000 | 0.013864 |
| ... | ... | ... |
| 1580 | 0.006666 | 0.286897 |
| 1581 | 0.013045 | 0.299942 |
| 1582 | 0.017260 | 0.317202 |
| 1583 | 0.023990 | 0.365183 |
| 1584 | 0.076578 | 0.441761 |
1585 rows × 2 columns
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np
# Assuming you have computed ccp_alphas using cost complexity pruning
# Limiting the number of ccp_alpha values (e.g., choose 10 evenly spaced alphas)
ccp_alphas = np.linspace(min(ccp_alphas), max(ccp_alphas), 10)
# List to store classifiers for each alpha
clfs = []
# Loop through the reduced list of alphas and train decision trees
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
# Logging the progress for better monitoring
print(f"Trained decision tree with ccp_alpha: {ccp_alpha}")
# Print the number of nodes in the last tree
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Trained decision tree with ccp_alpha: 0.0 Trained decision tree with ccp_alpha: 0.008508654974857064 Trained decision tree with ccp_alpha: 0.017017309949714128 Trained decision tree with ccp_alpha: 0.025525964924571192 Trained decision tree with ccp_alpha: 0.034034619899428256 Trained decision tree with ccp_alpha: 0.04254327487428532 Trained decision tree with ccp_alpha: 0.051051929849142384 Trained decision tree with ccp_alpha: 0.05956058482399945 Trained decision tree with ccp_alpha: 0.06806923979885651 Trained decision tree with ccp_alpha: 0.07657789477371357 Number of nodes in the last tree is: 1 with ccp_alpha: 0.07657789477371357
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Acc v Alpha in the training & testing sets
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Test accuracy of best model: ',best_model.score(X_test, y_test))
DecisionTreeClassifier(random_state=1) Training accuracy of best model: 0.9884215500945179 Test accuracy of best model: 0.8645594045759442
from sklearn import metrics
from sklearn.metrics import recall_score
recall_train = []
for clf in clfs:
pred_train3 = clf.predict(X_train)
values_train = recall_score(y_train, pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)
from sklearn.metrics import recall_score
# Recall score on training data
y_train_pred = best_model.predict(X_train)
recall_train = recall_score(y_train, y_train_pred)
# Recall score on test data
y_test_pred = best_model.predict(X_test)
recall_test = recall_score(y_test, y_test_pred)
print(f"Recall on training set: {recall_train}")
print(f"Recall on test set: {recall_test}")
Recall on training set: 0.9710630156642354 Recall on test set: 0.7978421351504826
# showing what metrics this model used
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [the_features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
comparison_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision tree with restricted maximum depth','Decision treee with hyperparameter tuning',
'Decision tree with post-pruning'], 'Train_Recall':[.981,.732,.732,.979], 'Test_Recall':[.792,.739,.739,.794]})
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Initial decision tree model | 0.981 | 0.792 |
| 1 | Decision tree with restricted maximum depth | 0.732 | 0.739 |
| 2 | Decision treee with hyperparameter tuning | 0.732 | 0.739 |
| 3 | Decision tree with post-pruning | 0.979 | 0.794 |
Insight
Actionable Insights
⚛ Cancellations are Highly Influenced by Lead Time
⚛ Market Segment Plays a Crucial Role
⚛ Price Sensitivity
⚛ Seasonality Affects Booking Behavior
⚛ Booking Trends with Lead Time
⚛ Impact of Market Segment on Booking Status
⚛ Effect of Room Price on Cancellations
⚛ Special Requests and Booking Completion
⚛ Influence of Arrival Month on Booking Behavior:
Recommendations
⚛ Early Booking Discounts
⚛ Market Segment-Specific Campaigns:
⚛ Dynamic Pricing Strategies:
⚛ Enhanced Booking Experience:
⚛ Focus on Direct Bookings:
⚛ Cancellation Policies:
⚛ Predictive Booking System: